Search CORE

505 research outputs found

Feedback Driven Improvement of Data Preparation Pipelines

Author: Konstantinou Nikolaos
Paton Norman
Publication venue
Publication date: 01/01/2019
Field of study

The University of Manchester - Institutional Repository

An analysis of extensible modelling for functional genomics data

Author: Jones Andrew R
Paton Norman W
Publication venue: BioMed Central
Publication date: 01/09/2005
Field of study

BACKGROUND: Several data formats have been developed for large scale biological experiments, using a variety of methodologies. Most data formats contain a mechanism for allowing extensions to encode unanticipated data types. Extensions to data formats are important because the experimental methodologies tend to be fairly diverse and rapidly evolving, which hinders the creation of formats that will be stable over time. RESULTS: In this paper we review the data formats that exist in functional genomics, some of which have become de facto or de jure standards, with a particular focus on how each domain has been modelled, and how each format allows extensions. We describe the tasks that are frequently performed over data formats and analyse how well each task is supported by a particular modelling structure. CONCLUSION: From our analysis, we make recommendations as to the types of modelling structure that are most suitable for particular types of experimental annotation. There are several standards currently under development that we believe could benefit from systematically following a set of guidelines

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

Dataset Discovery and Exploration: A Survey

Author: Chen Jiaoyan
Paton Norman
Wu Zhenyu
Publication venue
Publication date: 04/10/2023
Field of study

The University of Manchester - Institutional Repository

Source Selection Languages:A Usability Evaluation

Author: Abel Edward
Galpin Ixent
Paton Norman W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Crossref

The University of Manchester - Institutional Repository

Deep Clustering for Data Cleaning and Integration

Author: Freitas Andre
Paton Norman W.
Rauf Hafiz Tayyab
Publication venue
Publication date: 22/09/2023
Field of study

Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the impact of DC on mainstream data management tasks remains unexplored. In this paper, we address this gap by investigating the impact of DC in data cleaning and integration tasks, specifically schema inference, entity resolution, and domain discovery, tasks that represent clustering from the perspective of tables, rows, and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. However, we observed a significant correlation between the DC method and embedding approaches for rows, columns, and tables, highlighting that the suitable combination can enhance the efficiency of DC methods.Comment: The following enhancements have been carried out in the updated version of the manuscript: *Evaluated each data integration problem on additional datasets. *Added more DC and SC methods to the evaluation *Discussed algorithmic-specific observation

arXiv.org e-Print Archive

A critical and Integrated View of the Yeast Interactome

Author: Cornell Michael
Oliver Stephen G.
Paton Norman W.
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2004
Field of study

Global studies of protein–protein interactions are crucial to both elucidating gene function and producing an integrated view of the workings of living cells. High-throughput studies of the yeast interactome have been performed using both genetic and biochemical screens. Despite their size, the overlap between these experimental datasets is very limited. This could be due to each approach sampling only a small fraction of the total interactome. Alternatively, a large proportion of the data from these screens may represent false-positive interactions. We have used the Genome Information Management System (GIMS) to integrate interactome datasets with transcriptome and protein annotation data and have found significant evidence that the proportion of false-positive results is high. Not all high-throughput datasets are similarly contaminated, and the tandem affinity purification (TAP) approach appears to yield a high proportion of reliable interactions for which corroborating evidence is available. From our integrative analyses, we have generated a set of verified interactome data for yeast

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

Dataset Discovery in Data Lakes

Author: Bogatu Alex
Fernandes Alvaro
Konstantinou Nikolaos
Paton Norman
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/05/2020
Field of study

Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash-based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times

arXiv.org e-Print Archive

Crossref

The University of Manchester - Institutional Repository

A hierarchical decentralized architecture to enable adaptive scalable virtual machine migration

Author: Hummaida Abdul R.
Paton Norman W.
Sakellariou Rizos
Publication venue: 'Wiley'
Publication date: 25/01/2023
Field of study

The University of Manchester - Institutional Repository

Deep Clustering for Data Cleaning and Integration

Author: Freitas Andre
Paton Norman W.
Rauf Hafiz Tayyab
Publication venue: OpenProceedings
Publication date: 21/12/2024
Field of study

Deep Learning (DL) techniques now constitute the state-of-theart for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the potential of DC for data management tasks remains unexplored. In this paper, we address this gap by investigating the suitability of DC for data cleaning and integration tasks, specifically schema inference, entity resolution and domain discovery, from the perspective of tables, rows and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. Experiments also show consistently strong performance compared with state-of-the-art bespoke algorithms for each of the data integration tasks

The University of Manchester - Institutional Repository

Voyager: Data Discovery and Integration for Data Science

Author: Bogatu Alex
Douthwaite Mark
Freitas Andre
Paton Norman W.
Publication venue
Publication date: 23/03/2022
Field of study

The University of Manchester - Institutional Repository